This file describes the preliminary analyses of three test-concepts in the QLVLnewscorpora: penis, inleiding & hart. The concepts were selected from the full list of concepts (N = 433) that I collected from WordNet, Van Dale and DLP2. Information about the full set of concepts is available here:
At this moment, parameters selection was based on observations in Mariana's analyses of nouns & verbs, as well as comments in the parameters google doc. At this moment, the following parameter settings were used to construct token models:
| parameter name | FOC | SOC |
|---|---|---|
| Definition target type | lemma/pos | lemma/pos |
| Window size | fixed: 10 | fixed: 4 |
| Boundaries | sentence/none | none |
| cw selection: strategy | local/global | global |
| cw selection: settings | local: * nav with freq > 200 * collfreq = 3 * ppmi > 1 * llr None or > 1 * global: nav top-5000 |
nav top-5000 |
| Weighting | ppmi | none |
Of these, I plan to vary the boundaries (default: sentence) and the context word selection settings for FOC's. Specifically, I will compare implementing an LLR-filter or not within the "local"1 strategy, as well as a local versus a global2 strategy. In the latter case, all top-5000 nav context words will be considered.
This concept was selected because it's a difficult one, with many variables (N = 17, excluding constructions) and varying frequencies per variable.
| variant | frequency |
|---|---|
| ding/noun | 80601 |
| fluit/noun | 1447 |
| jongeheer/noun | 105 |
| lid/noun | 107912 |
| lul/noun | 1155 |
| mannelijkheid/noun | 459 |
| penis/noun | 1252 |
| piemel/noun | 372 |
| pik/noun | 451 |
| pisser/noun | 4 |
| plasser/noun | 18 |
| potlood/noun | 1504 |
| sjarel/noun | 6 |
| snikkel/noun | 18 |
| speer/noun | 1217 |
| tampeloeres/noun | 1 |
| zwengel/noun | 42 |
This causes two problems for the models & analysis:
A possible solution for the latter problem is to only sample the relevant tokens for the highly frequent types. This can be done in two ways:
To find a way of extracting context words for the problematic variants, we need a tokenmodel for the non-problematic ones that performs well. The best model would be a model that (1) has a good fit to the data (to avoid artifical effects, e.g. regional differences) and (2) has a (relatively) clear semantic region (or branch) where most observations for the target concept are located (precision), while out-of-concept tokens are located somewhere else (recall). As in other studies in the NephoSem-project, determining what the best model is, is not straightforward. There are a number of procedures that can be considered:
So far, eight solutions with tsne-clustering and three (four) models with nmds have been constructed. All the tokenmodels have the following parameters:
| parameter name | FOC | SOC |
|---|---|---|
| Definition target type | lemma/pos | lemma/pos |
| Window size | fixed: 10 | fixed: 4 |
| Boundaries | sentence |
none |
| cw selection: strategy | local |
global |
| cw selection: settings | local: * nav with freq > 200 * collfreq = 3 * ppmi > 1 * llr None * |
nav top-5000 |
| Weighting | ppmi | none |
You can find a shiny-app to explore the models that have been analyzed so far here.
The t-SNE-solutions additionally vary according to two parameters:
Overall, it looks like the more stable models are the ones with more runs and perplexity 30. Models with very low perplexity (N = 10) look like they have too many small clusters. Choosing other settings than 'lemma' for the colors in the model plot shows that none of the lectal variables in the data seem to play a role.
I have tried four NMDS-solutions so far:
The first nmds-solutions is really bad, with a high stress value (> 0.28) and it did not converge. The second solution ran for over twelve hours and was only at trial 178 with stress-values comparable the first solution (at this point, I killed the process). The third and fourth solutions are the best ones so far, with in both cases a stress value of 0.1334 (for the same trial), but still no convergence. The second dimension may be the one were after but it's not the case that all variants with the target meaning are at the bottom of the plot, not that all out-of-concept variants are at the top.
Since we're running into problems with the NMDS-models, I analyzed where the problematic tokens are located. I used goodness() from library(vegan) to obtain a goodness-of-fit-value per token:
goodness() finds a goodness of fit statistic for observations (points). This is defined so that sum of squared values is equal to squared stress. Large values indicate poor fit.
This plot shows the results for the fourth NMDS-solution. It shows that the less problematic tokens (lighter colours) are located at the top left op the plot, where the observations for fluit and potlood are located - typically with their prototypical meaning, as well as with the tokens for penis in the middle. Perhaps the model has more trouble with variants that are more polysemous
Finally, I also used agglomerative hierarchical clustering (Ward's method) to analyze these data. Rather than choosing a number of clusters beforehand, I considered between 2 and 50 clusters, basing the optimal number of clusters on the silhouette width of the clusters. The optimal number of clusters is 45 in these data (sw = 0.358), with solutions that have 15 clusters or more reaching acceptable results (sw > 0.2). Obviously, solutions with 15 or more clusters are difficult to intepret, but for the purpose of illustration, this plot shows the solution with 15 clusters isolate one cluster by double-clicking on its symbol in the legend). The x- and y-axis show the results from the t-SNE-solution with perplexity = 30 and 5000 runs. Some of the clusters make a lot of sense, especially when they're also the ones that are seperated by t-SNE as well (e.g. the ouwe lul-cluster at the right of the plot in magenta). Other are more diverse (e.g. clusters 2 and 5).
With fewer clusters only some of the clearer division are (obviously) retained. Cluster 3 in the solution above, for instance, has body parts as context words. In the solution below, it is added to the more diverse cluster 2.
Defined in the google doc as: "potentially all words within the specified window span around the target token". Note that my definition of "local" is not extreme, as I am only including nav's with a frequency of > 200. However, it is local in the sense that potentially all these words can be considered (N = 37807)↩
Defined in the google doc as "fixeds set of context words, same for all target types". Here the 5000 most frequent nav's.↩
This may also be related to the parameter settings that are used, e.g. if no nav's of frequency > 200 occur with the target type in a particular token/observation, this type is not included in the model.↩
Note that while it may be dangerous to use this strategy, it doesn't have to be. We just don't know yet.↩
An alternative strategy may be to semasiologically analyze these variants. Specifically for ding this could be a fruitful approach, because this variant is highly polysemous and is also included as a hgh-level word in the WordNet-taxonomies. It is not known whether the penis-meaning of ding would show up in such an analysis.↩
We could select high-frequency candidates from the association data of Gert Storms for this purpose.↩
This determines how many times the algrotihm can try to find a stable solution. If it doesn't succeed in the specified number of random starts, there is no succesful convergence.↩